Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eth/downloader: drop beacon head updates if the syncer is restarting #27397

Merged
merged 2 commits into from
Jun 5, 2023

Conversation

karalabe
Copy link
Member

@karalabe karalabe commented May 31, 2023

Fixes #27386.

When a reorg happens, the beacon/skeleton syncer needs to stop, swap out it's internal state to the new head and restart. If the downloader is busy importing blocks in the mean time, it will prevent the beacon syncer from completing it's restart loop until all queued blocks are consumed.

During this "hang", any newly arriving newPayload or forckchoiceUpdate messages will get blocked as they cannot feed their respective new data to the beacon syncer. If the "hang" lasts 15 minutes, these update messages will keep piling and piling, consuming RAM and goroutines.

The PR "fixes" this issue by dropping update messages on the floor while in this limbo state. This should unblock engine API events and prevent bloat.

That said, it does introduce a wonky behavior: if a newPayload arrives back to back after a fockchoiceUpdate that triggered a reorg, that newPayload might get ignored. This isn't a big of a deal for a syncing node, but if the node is already in sync, it will "miss" a block and will need to wait for the next one to come back into sync. If the entire network does this however, will it ever come back into sync? Not clear, seems a bit unsafe.

Screenshot 2023-06-05 at 09 54 39

@karalabe karalabe added this to the 1.12.1 milestone May 31, 2023
@holiman
Copy link
Contributor

holiman commented May 31, 2023

The PR "fixes" this issue by dropping update messages on the floor while in this limbo state. This should unblock engine API events and prevent bloat.

So if we have this flow of events piling up: h1, h2, h3, h4, h5, h6 . In your solution, as I understand it, if the block happens, then the h2, .. h6 are dropped on the floor, and h1 is the only one accepted. Yes?

Seems to me that a better solution would be to, even if the internals are busy, keep the best/latest update, and discard earlier. So that h1, ... h5 are dropped to the floor, and h6 is eventually processed.

So basically;

  • try to import header h,
  • Can we do that now or is the lock held?
  • Ok, lock held, import is in progress.
  • Compare h and possibly swap h for latestUnprocessedFCUHeader

Once the api is done with whatever it was doing, it takes the latestUnprocessedFCUHeader and processes it.

How about something like that?

@karalabe
Copy link
Member Author

karalabe commented Jun 1, 2023

Not sure it matters. If there's a header gap, you need to do a sync run afterwards. Now whether you keep the last header or drop all and wait for teh next update is kind of similar. Tho' I can see your version a bit more stable. All in all I'm unhappy about both.

That said, the moment the thing locks up, even if we block all goroutines somewhere the order of insertion after that will be random, so it's gonna get wonky anyway.

I guess a possibly better solution would be to have a queue of headers that arrived whilst we were offline and try to consume it when the syncer is restarted. That way we could preserve the inbound order at least and not lost any headers.

@karalabe
Copy link
Member Author

karalabe commented Jun 1, 2023

Ah, I think I can make this more elegant. I'll make the suspend run on it's own goroutine instead of the syncer thread and have the synced keep eating up events in between.

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM. Minor note, though, it "looks" more complex than it actually is, in my humble opinion... For me at least, it would look simpler if you had moved the drainer into a new routine, and left the other one where it was. Like this:

	defer func() {
		done := make(chan struct{})
		go func() {
			// Keep draining the headEvents, to not block sender (updates
			// coming in via the beacon RPC api).
			for {
				select {
				case <-done:
					return
				case event := <-s.headEvents:
					event.errc <- errors.New("beacon syncer reorging")
				}
			}
		}()
		if filled := s.filler.suspend(); filled != nil {
			// If something was filled, try to delete stale sync helpers. If
			// unsuccessful, warn the user, but not much else we can do (it's
			// a programming error, just let users report an issue and don't
			// choke in the meantime).
			if err := s.cleanStales(filled); err != nil {
				log.Error("Failed to clean stale beacon headers", "err", err)
			}
		}
		close(done)
	}()

Do with it as you see fit :)

@karalabe karalabe merged commit a7b2106 into ethereum:master Jun 5, 2023
devopsbo3 pushed a commit to HorizenOfficial/go-ethereum that referenced this pull request Nov 10, 2023
…thereum#27397)

* eth/downloader: drop beacon head updates if the syncer is restarting

* eth/donwloader: v2 of the goroutine spike preventer
devopsbo3 added a commit to HorizenOfficial/go-ethereum that referenced this pull request Nov 10, 2023
devopsbo3 added a commit to HorizenOfficial/go-ethereum that referenced this pull request Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spike in goroutines when CL blocked on setting new head
2 participants